Mapping Crime In Chicago, IL

By Koya Saito
CMSE 201 Section 007, Dr. Finzell

Copyrighted%20Chicago%20Skyline.jpg Creator: zrfphoto | Credit: Getty Images/iStockphoto Copyright: zrfphoto

The mission of this project is to make tangible the overwhelming amount of crime in Chicago. While we have enough information to calculate statistics and mark trends, how do we wrap our head around 2.5 million seperate occurances throughout the city? The aim of this project is to visualize crime in such a way that information is added to the CSVs that the data comes in. Added in the sense that it is intuitive to gain an understanding of the geography and draw connections between other forms of geospatial data. In this project crime data from the years 2008 through 2011 are mapped and displayed.

Also addressed in this project is the ethical responsibility of a data scientist in interpretation of data to indentify biases and prevent misinformed assumptions from being made through analysis. While this project displays crime data throughout the entire city of Chicago, it falls short in the breadth of the analysis, only considering one variable, median household income, as an influencing factor to crime. As a follow up to the mapping in this report, included is a brief section on the dangers of misuising statistics.

There are three sections to this project, each containing a different map.

  • General Crime Map of Chicago
  • Crime, Mapped by Census Tract
  • Mapping Socioeconomic indicators (Hardship Index)

Using the mouse, hover over each to see more information. Using scroll and the pointer, move the map around, focusing on different parts.

Code

In [1]:
# Data processing
import numpy as np
import pandas as pd
import json
from shapely.geometry import shape, Point

# Standard plotting
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Map plotting
import folium 
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Loading Data

In [2]:
# Import data
chicago_crimes = pd.read_csv('data/Crime/Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False) # Error bad lines False causes some data to be skipped, our dataset may not be a complete representation
income_data = pd.read_csv('data/Income/Total Income by Location.csv')

with open('data/Geo Boundaries/Boundaries - Census Tracts - 2010.geojson') as JsonBounds:
    census_geo_JSON = json.load(JsonBounds)
b'Skipping line 1149094: expected 23 fields, saw 41\n'
In [3]:
income_data['Formatted ID Geography'] = income_data['ID Geography'].str.slice(7) # Converting ID Geography column to same format as JSON data
income_data = income_data[income_data['Year'] == 2013].reset_index()
income_data.head()
Out[3]:
index ID Year Year ID Race Race Household Income by Race Household Income by Race Moe Geography ID Geography Formatted ID Geography
0 3999 2013 2013 0 Total 35591 6291.0 Census Tract 5804, Cook County, IL 14000US17031580400 17031580400
1 4000 2013 2013 0 Total 48047 4974.0 Census Tract 7305, Cook County, IL 14000US17031730500 17031730500
2 4001 2013 2013 0 Total 53142 8529.0 Census Tract 7304, Cook County, IL 14000US17031730400 17031730400
3 4002 2013 2013 0 Total 88922 27446.0 Census Tract 812.02, Cook County, IL 14000US17031081202 17031081202
4 4003 2013 2013 0 Total 66875 36159.0 Census Tract 814.01, Cook County, IL 14000US17031081401 17031081401

Crime Map of Chicago

This page was a major contributing factor: https://plotly.com/python/mapbox-county-choropleth/

In [4]:
# Getting range of data
minIncome = min(income_data['Household Income by Race'])
maxIncome = max(income_data['Household Income by Race'])

# This function formats column as though it's money
# https://stackoverflow.com/questions/35019156/pandas-format-column-as-currency
def format(x):
        return "${:.1f}K".format(x/1000)

# Defining a row in datafram for mouse hovering
income_data['text'] = 'Median household income: ' + income_data['Household Income by Race'].apply(format).astype(str) + '<br>' + income_data['Geography'] 

Creating map of median income data

In [5]:
fig = go.Figure(go.Choroplethmapbox(geojson=census_geo_JSON, 
                                    featureidkey="properties.geoid10",
                                    locations=income_data["Formatted ID Geography"], 
                                    z=income_data['Household Income by Race'],
                                    text=income_data['text'],
                                    hoverinfo='text',
                                    colorscale="Blues", 
                                    marker_line_width=1,
                                    marker_line_color='white',
                                    marker_opacity=0.5,
                                    zmin=minIncome,
                                    zmax=maxIncome))
In [6]:
# fig.update_layout(mapbox_style="carto-positron", mapbox_zoom=9, mapbox_center = {"lat": 41.864073, "lon": -87.706819})
fig.update_layout(mapbox_style="light", 
                  mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
                  mapbox_zoom=9, 
                  mapbox_center = {"lat": 41.864073, "lon": -87.706819},
                 )

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Adding crime data to map

In [7]:
# Income data choropleth trace object
trace1 = go.Choroplethmapbox(geojson=census_geo_JSON, 
                                    featureidkey="properties.geoid10",
                                    locations=income_data["Formatted ID Geography"], 
                                    z=income_data['Household Income by Race'],
                                    text=income_data['text'],
                                    hoverinfo='text',
                                    colorscale="Blues", 
                                    marker_line_width=1,
                                    marker_line_color='white',
                                    marker_opacity=0.5,
                                    zmin=minIncome,
                                    zmax=maxIncome)

Note that we can adjust how many crimes are shown on this map. Analysis of this data would include splitting the crime data in different ways including type of crime, time committed geography and more.

In [8]:
chicago_crime = chicago_crimes[:10000]

# Crime scattermapbox trace object
trace2 = go.Scattermapbox(
        lat=chicago_crime['Latitude'],
        lon=chicago_crime['Longitude'],
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=3,
            color='rgb(242, 177, 172)',
            opacity=0.7
        ),
        hoverinfo='text',
    text=chicago_crime['Primary Type']
    )
In [9]:
fig = make_subplots()

fig.add_trace(trace1)
fig.add_trace(trace2)

fig.update_layout(mapbox_style="light", 
                  mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
                  mapbox_zoom=8, 
                  mapbox_center = {"lat": 41.864073, "lon": -87.706819},
                 )

fig.show()

How Much Crime Happens in Each Census Tract?

In [ ]:
# Converting dictionary, previously used for Choropleth to DataFrame
census_geo_df = pd.DataFrame.from_dict(census_geo_JSON)
census_geo_df.head()

print(census_geo_df['features'][0])
In [11]:
# How many census tracts are we dealing with?
print(len(census_geo_df.index)) # According to our boundary data
print(len(income_data['Geography'].unique())) # According to our median income data

# Assuming that this checks out with external references (most say 866, we're using 2010 data?)
801
801
In [12]:
# What our data points look like?
print(chicago_crimes['Location'].head() )

print('')
print(chicago_crimes['Longitude'].head())
print(chicago_crimes['Latitude'].head())
0    (41.758275857, -87.622451031)
1     (41.87025207, -87.746069362)
2    (41.770990476, -87.698901469)
3    (41.894916924, -87.757358147)
4    (41.843826272, -87.709893465)
Name: Location, dtype: object

0   -87.622451
1   -87.746069
2   -87.698901
3   -87.757358
4   -87.709893
Name: Longitude, dtype: float64
0    41.758276
1    41.870252
2    41.770990
3    41.894917
4    41.843826
Name: Latitude, dtype: float64

Defining a counting function

In [13]:
''' 
    Creates columns (Census Tract Name and Census Tract geoID) for crime data 
    indicating what census tract a crime happened in. NaN values are added for 
    crimes where Lon/Lat data isn't availible or outside of all Census Tracts we're looking at.
    
    For checking if a point is in a polygon references:
    https://stackoverflow.com/questions/20776205/point-in-polygon-with-geojson-in-python
    http://archived.mhermans.net/geojson-shapely-geocoding.html
    https://pypi.org/project/Shapely/
'''


def locateCrimesPerTract(crimes, tracts):
    census_tract_column = []
    census_geoID_column = []

    count = 0

    for index,row in crimes.iterrows():

        # Progress bar 
        if count % 100 == 0:
            print('Num crimes located: {:8d}   |   Percent done: {:5.2f}'.format(count, count/len(crimes.index)) , end='\r')
        count+=1
        
        # Normally commented out, used for debugging. Prematurely stops loop at given number
#         if count >= 1555:
#             print('here')
#             break

        lon_untested = row['Longitude']
        lat_untested = row['Latitude']

        if np.isnan(lon_untested) or np.isnan(lat_untested):
            census_tract_column.append(np.nan)
            census_geoID_column.append(np.nan)
            continue
        else:
            lon = lon_untested
            lat = lat_untested
            point = Point(lon, lat)
            
            tractName = np.nan
            geoID = np.nan
            for feature in tracts['features']:
                polygon = shape(feature['geometry'])
                if polygon.contains(point):
                    tractName = feature['properties']['namelsad10']
                    geoID = feature['properties']['geoid10']
                    break
            census_tract_column.append(tractName)
            census_geoID_column.append(geoID)

    return(census_tract_column, census_geoID_column)

Running the counting function

For each year we run the counting function, checking what tract each crime happened in. Running these takes a long time and I had to let them run overnight

In [14]:
census_tract_column, census_geoID_column = locateCrimesPerTract(crimes2011, census_geo_df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-cb706c96077f> in <module>
----> 1 census_tract_column, census_geoID_column = locateCrimesPerTract(crimes2011, census_geo_df)

NameError: name 'crimes2011' is not defined
In [ ]:
print(len(census_tract_column))
print(len(crimes2011.index))
In [ ]:
crimes2011 = crimes2011.copy()
crimes2011['GeoID'] = census_geoID_column
crimes2011['Tract'] = census_tract_column
In [ ]:
crimes2011.to_csv('data/Crime/Processed Data/crimes2011.csv', index=False)
In [ ]:
tract_2008, geoID_2008 = locateCrimesPerTract(crimes2008, census_geo_df)
crimes2008 = crimes2008.copy()
crimes2008['GeoID'] = geoID_2008
crimes2008['Tract'] = tract_2008
crimes2008.to_csv('data/Crime/Processed Data/crimes2008.csv', index=False)
In [ ]:
tract_2009, geoID_2009 = locateCrimesPerTract(crimes2009, census_geo_df)
crimes2009 = crimes2009.copy()
crimes2009['GeoID'] = geoID_2009
crimes2009['Tract'] = tract_2009
crimes2009.to_csv('data/Crime/Processed Data/crimes2009.csv', index=False)
In [ ]:
tract_2010, geoID_2010 = locateCrimesPerTract(crimes2010, census_geo_df)
crimes2010 = crimes2010.copy()
crimes2010['GeoID'] = geoID_2010
crimes2010['Tract'] = tract_2010
crimes2010.to_csv('data/Crime/Processed Data/crimes2010.csv', index=False)

Mapping Crime By Census Tract

Combining all the data back into the same DataFrame

In [23]:
processed_crimes2008 = pd.read_csv('data/Crime/Processed Data/crimes2008.csv')
processed_crimes2009 = pd.read_csv('data/Crime/Processed Data/crimes2009.csv')
processed_crimes2010 = pd.read_csv('data/Crime/Processed Data/crimes2010.csv')
processed_crimes2011 = pd.read_csv('data/Crime/Processed Data/crimes2011.csv')

frames = [processed_crimes2008, processed_crimes2009, processed_crimes2010, processed_crimes2011]
processed_crimes = pd.concat(frames)
processed_crimes.head()
/Users/koya/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3071: DtypeWarning:

Columns (4) have mixed types.Specify dtype option on import or set low_memory=False.

Out[23]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic ... FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location Tract GeoID
0 4785 HP610824 10/07/2008 12:39:00 PM 000XX E 75TH ST 0110 HOMICIDE FIRST DEGREE MURDER ALLEY True False ... 01A 1178207.0 1855308.0 2008 08/17/2015 03:03:40 PM 41.758276 -87.622451 (41.758275857, -87.622451031) Census Tract 6910 1.703169e+10
1 4786 HP616595 10/09/2008 03:30:00 AM 048XX W POLK ST 0110 HOMICIDE FIRST DEGREE MURDER STREET True False ... 01A 1144200.0 1895857.0 2008 08/17/2015 03:03:40 PM 41.870252 -87.746069 (41.87025207, -87.746069362) Census Tract 8314 1.703183e+10
2 4787 HP616904 10/09/2008 08:35:00 AM 030XX W MANN DR 0110 HOMICIDE FIRST DEGREE MURDER PARK PROPERTY False False ... 01A 1157314.0 1859778.0 2008 08/17/2015 03:03:40 PM 41.770990 -87.698901 (41.770990476, -87.698901469) Census Tract 6609 1.703166e+10
3 4788 HP618616 10/10/2008 02:33:00 AM 052XX W CHICAGO AVE 0110 HOMICIDE FIRST DEGREE MURDER RESTAURANT False False ... 01A 1141065.0 1904824.0 2008 08/17/2015 03:03:40 PM 41.894917 -87.757358 (41.894916924, -87.757358147) Census Tract 2515 1.703125e+10
4 4789 HP619020 10/10/2008 12:50:00 PM 026XX S HOMAN AVE 0110 HOMICIDE FIRST DEGREE MURDER GARAGE False True ... 01A 1154123.0 1886297.0 2008 08/17/2015 03:03:40 PM 41.843826 -87.709893 (41.843826272, -87.709893465) Census Tract 8408 1.703184e+10

5 rows × 24 columns

A function to filter through our crime data and count how many times each census had an occurence. This function also formats this information and returns it cleanly

In [24]:
def countCrimePerTract(processed_crimes):
    crimePerTract = processed_crimes['Tract'].value_counts().rename_axis('Tract').reset_index(name='Num Crimes')
    crimePerTract['GeoID'] = processed_crimes['GeoID'].value_counts().rename_axis('GeoID').reset_index(name='Num Crimes')['GeoID'].astype(int).astype(str)

    crimePerTract = crimePerTract[['GeoID', 'Tract', 'Num Crimes']]
    
    return crimePerTract
In [25]:
crimePerTract = countCrimePerTract(processed_crimes)
crimePerTract.head()
Out[25]:
GeoID Tract Num Crimes
0 17031839100 Census Tract 8391 25695
1 17031251900 Census Tract 2519 16657
2 17031251800 Census Tract 2518 15076
3 17031420700 Census Tract 4207 14283
4 17031231200 Census Tract 2312 13203

Crime Map

In [26]:
minCrime = min(crimePerTract['Num Crimes'])
maxCrime = max(crimePerTract['Num Crimes'])

crimeChoropleth = go.Figure(go.Choroplethmapbox(geojson=census_geo_JSON, 
                                    featureidkey="properties.geoid10",
                                    locations=crimePerTract['GeoID'], 
                                    z=crimePerTract['Num Crimes'],
                                    text=crimePerTract['Num Crimes'],
                                    hoverinfo='text',
                                    colorscale="Reds", 
                                    marker_line_width=1,
                                    marker_line_color='white',
                                    marker_opacity=0.5,
                                    zmin=minCrime,
                                    zmax=maxCrime))


crimeChoropleth.update_layout(mapbox_style="light", 
                  mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
                  mapbox_zoom=8.75, 
                  mapbox_center = {"lat": 41.864073, "lon": -87.706819},
                 )

crimeChoropleth.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
crimeChoropleth.show()

Map above shows number of crimes committed over the course of 4 years within each census tract.

Mapping Socioeconomic indicators (Hardship Index)

In [27]:
with open('data/Geo Boundaries/Boundaries - Community Areas (current).geojson') as JsonBounds:
    community_geo_JSON = json.load(JsonBounds)
    
socioeconomic_indicators = pd.read_csv('data/Income/Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv')
In [28]:
socioeconomic_indicators['COMMUNITY AREA NAME'] = socioeconomic_indicators['COMMUNITY AREA NAME'].str.upper()
socioeconomic_indicators
Out[28]:
Community Area Number COMMUNITY AREA NAME PERCENT OF HOUSING CROWDED PERCENT HOUSEHOLDS BELOW POVERTY PERCENT AGED 16+ UNEMPLOYED PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA PERCENT AGED UNDER 18 OR OVER 64 PER CAPITA INCOME HARDSHIP INDEX
0 1.0 ROGERS PARK 7.7 23.6 8.7 18.2 27.5 23939 39.0
1 2.0 WEST RIDGE 7.8 17.2 8.8 20.8 38.5 23040 46.0
2 3.0 UPTOWN 3.8 24.0 8.9 11.8 22.2 35787 20.0
3 4.0 LINCOLN SQUARE 3.4 10.9 8.2 13.4 25.5 37524 17.0
4 5.0 NORTH CENTER 0.3 7.5 5.2 4.5 26.2 57123 6.0
... ... ... ... ... ... ... ... ... ...
73 74.0 MOUNT GREENWOOD 1.0 3.4 8.7 4.3 36.8 34381 16.0
74 75.0 MORGAN PARK 0.8 13.2 15.0 10.8 40.3 27149 30.0
75 76.0 O'HARE 3.6 15.4 7.1 10.9 30.3 25828 24.0
76 77.0 EDGEWATER 4.1 18.2 9.2 9.7 23.8 33385 19.0
77 NaN CHICAGO 4.7 19.7 12.9 19.5 33.5 28202 NaN

78 rows × 9 columns

In [29]:
indicatorChoropleth = go.Figure(go.Choroplethmapbox(geojson=community_geo_JSON, 
                                    featureidkey="properties.community",
                                    locations=socioeconomic_indicators['COMMUNITY AREA NAME'], 
                                    z=socioeconomic_indicators['HARDSHIP INDEX'],
                                    text=socioeconomic_indicators['HARDSHIP INDEX'],
                                    hoverinfo='text',
                                    colorscale="Reds", 
                                    marker_line_width=1,
                                    marker_line_color='black',
                                    marker_opacity=0.5,
                                    zmin=min(socioeconomic_indicators['HARDSHIP INDEX']),
                                    zmax=max(socioeconomic_indicators['HARDSHIP INDEX'])))


indicatorChoropleth.update_layout(mapbox_style="light", 
                  mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
                  mapbox_zoom=8.75, 
                  mapbox_center = {"lat": 41.864073, "lon": -87.706819},
                 )

indicatorChoropleth.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
indicatorChoropleth.show()

Results

This project is not in the position to make any difinitive claims about the nature of crime and its connection to the discussed variables. However, we can review the visualizations we made.

compiled%20maps.png

Each of the maps displays one aespect of the problem. Due to the fact that we were able to base the city gradients off of the minimum and maximum values from the data, each map is a normalized view of the respective variable, helping to understand each on a local scale as compared to the rest of the city. Another great thing about these maps is their potential to be expanded. The City of Chicago data portal has an incredible ammount of diverse data to be analyzed.

Responsible Data Analysis

This section of the project is a short introduction to the moral ambiguity of data analysis. Due to the fact that it is an analyst's job to try and understand a dynamic system, one function of the job is to create assumptions and generalizations in order to understand the assumed underlying patterns of data. In general, this is a great strategy which has led to the rapid success of the field but has moral implications when these assumptions are being used to evaluate and make decisions for the wellbeing of human beings. While doing analysis of people for people, it is important not only to acknowledge the complexity of human systems but to be certain that all potential influencing factors are accounted for.

The following resource is a great example of how the immoral use of data analysis has been harmful in the past. This video is great for two reasons. 1) It gives examples of data analysis being misused both intentionally and unintentionally. 2) It acknowledges the human ability to take facts out of context, despite the common misconception that numbers and logic provide unshakable evidence.

In [30]:
from IPython.display import YouTubeVideo  
YouTubeVideo("bVG2OQp6jEQ",width=640,height=360)

# https://youtu.be/bVG2OQp6jEQ
Out[30]:

The following section is a snippet of code which could have been a part of this project if not for the precautions of responsible data analysis. Dangerous for several reasons, the primary one is the most subtle. Displaying data in this way causes an uninformed viewer to come to a conclusion they think they reached on their own. The data is laid out in such a way that, despite being "technically true" fails to capture the full context of the problem and vastly oversimplifies the issue it represents.

In [31]:
# Creating a DataFrame with crime and income together

tractInfo = crimePerTract

median_incomes = []
for index, row in tractInfo.iterrows():
    geoID = row['GeoID'] 
#     median_incomes.append(income_data['Household Income by Race'][income_data['Formatted ID Geography'] == geoID])
    
    ind = income_data.index[income_data['Formatted ID Geography'] == geoID].tolist()
    
    if len(ind) > 0:
        median_incomes.append(income_data['Household Income by Race'][ind[0]])
    else:
        median_incomes.append(np.nan)
    
tractInfo['Median Household Income'] = median_incomes
tractInfo.head()
Out[31]:
GeoID Tract Num Crimes Median Household Income
0 17031839100 Census Tract 8391 25695 78000.0
1 17031251900 Census Tract 2519 16657 23356.0
2 17031251800 Census Tract 2518 15076 20750.0
3 17031420700 Census Tract 4207 14283 27500.0
4 17031231200 Census Tract 2312 13203 23097.0
In [32]:
tractInfo.plot(kind='scatter', x='Num Crimes', y='Median Household Income')
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1183d6520>

Example

Finally, a famous example of unintentional bias in data science: Survivorship bias.

Quoted from Wikipedia, "Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility." This is a type of bias which arises from the ignorance of certain variables due to the inability to collect data points representing these variables. By anaylzying a dataset not including a representative sample, one ignores the anomaly and fundimentally, the group entirely.

Survivorship-bias.png

The example scenario takes place in a war. A general sends a fleet of planes on a mission, off to bomb a target. The fleet of planes successfully completes the bombing, losing several planes but a majority returning home nonetheless. Upon returning home, the general has the army's data scientists conduct an analysis of the returning planes to aid the engineers in upgrading them. These upgrades consist of reinforcing certain parts of each plane, armoring parts making them more resistant to being shot with a tradeoff of making the aircraft heavier. The data scientists mark all the locations on the planes where bullet holes are found, and return with the above diagram. They infer that since bullets are clearly clustered around the locations with red dots, they will tell the engineers to armor in these plaecs. This is where the bias arises.

The data scientist's mistake is assuming the places with bullet holes need armor. The planes who are shot in these places crash, and are therefore not a part of the dataset consisting of returning planes. This deduction and therein bias is famously called the survivorship bias and is relevant to the topics discussed in this project.

For more specific examples of bias, view this towardsdatascience article on different types of bias.

Thank you for viewing my project!

In [ ]: